Language/OS - Multiplatform Resource Library

home *** CD-ROM | disk | FTP | other *** search

/ Language/OS - Multiplatform Resource Library / LANGUAGE OS.iso / icon / contrib / pcw.lha / chap4.let < prev next >

Wrap

Text File | 1991-11-20 | 20.7 KB | 458 lines

.KF:chap4.toc .KW:59 .N:93 .XT:2 .XB:0 .X:10 .L:59 .M:1 L---+----1----+----2----+----3----+----4----+----5T---+---R6----+----7----+----8 .H: .H: .H: .F: .F:...$$$... .M:1 CHAPTER 4 MACHINE TRANSLATION OF MATTHEW 26:1-35 .K:4. MACHINE TRANSLATION OF MATTHEW 26:1-35 .M:2 This chapter will discuss the implementation and theoretical basis of the machine translation program developed in conjunction with this thesis. The program accepts as its source a derivative of the text found in the Semantic Structure Analysis (SSA) displays of the previous chapter. This choice of source text is explained in the following section. A sample of the text is included in Appendix C. The program also references a specialized lexicon referred to in this thesis as a semanticon. A portion of the semanticon is included in Appendix D. A sample of the program's translated output (in Spanish) is included in Appendix E. A diskette containing the trans lation program and all the files necessary to run it is bound into the back of this thesis. The complete trans lation of Matthew 26:1-35 is included on the diskette. The contents of the diskette are outlined in Appendix F. A listing of the program, which is written in the ICON programming language, is contained in Appendix G. .H: .H: $$$ .H: .F: .F: 94 Theoretical Basis Of The Implementation .K: Theoretical Basis Of The Implementation A fundamental principle underlying the design of the machine translation program is the notion that it is reasonable to put a good deal of manual analysis into a text that will be translated into a multitude of target languages. An example of such a text is the Bible, which still has not been translated into some 3500 minority languages. Some other suitable candidates for this type of treatment are the legislation of the European Community, and owner's manuals for various products. A corollary to this first principle is the notion that any machine trans lation program will be more successful if the grammar of the source text is as limited as possible. In keeping with this corollary the syntax of the program's input text has been greatly simplified as set forth in the previous chapter about Semantic Structure Analysis. A second fundamental principle is that the program attempts to translate meaning rather than just words. This is because word based machine translations often produce wrong meaning due to ambiguities in the source text. Another problem with word based translation programs is that they become large, complex, and slow because they must employ various techniques to try to minimize the errors which spring from ambiguities in the source text. One of 95 the greatest problems with word-based translations is that they assume that surface structures between languages are identical. This ignores the fact that every language has its own devices for skewing the basic relations between concepts and propositions in producing surface structures, and the rules for such skewing are very context-sensitive. For these reasons, the program presented in this thesis attempts to translate meaning, and to that end, the analysis of the source text is based on the theory ex pounded in The Semantic Structure of Written Communication (SSWC) by Beekman, Callow, & Kopesec (1981). According to the SSWC, concepts/meanings come in four classes: things, events, attributes, and relations (1981:49). In their simplest forms things are represented by nouns, events by verbs, attributes by adjectives and adverbs, and relations by function words like conjunctions, sentence adverbs, and prepositions. A formidable problem for the translator presents itself when concepts are not represented in their simplest forms; this is called lexical skewing. For instance, in the sentence, 'John gave Mary some help' the word 'help' is really an event. A simpler (i.e. unskewed) way to express the same meaning would be, 'John helped Mary.' 96 A linguistic universal could be claimed here. That is, all languages allow unskewed forms of expression, but no language allows all possible skewed forms of a concept. While it is beyond the scope of this thesis to attempt to prove the validity of this linguistic universal, there is ample anecdotal evidence to support it. For instance, in Spanish it is impossible to use the word for 'grape' as an adjective. So in Spanish one would never talk about 'grape wine', but one could express the concept in unskewed form as vino de uvas 'wine from grapes'. Another assumption underlying the implementation of this program is that the analysis of the source text will be done primarily by native speakers of the source language. Likewise, post-editing of the translated text will be performed primarily by native speakers of the target language. The role of any bilingual person involved in the translation process could be limited to that of consultant and translation checker. This approach has the obvious benefit of reducing the need for scarce, expensive bilingual translation specialists. The text that was translated as a part of this thesis represents something of a special case in that the analysis of the original text was, for obvious reasons, not done by native speakers of Koine Greek. Nevertheless, it 97 could certainly be argued that the process of analyzing the original text would have been greatly simplified if such speakers of Koine Greek were still available. It should also be pointed out that the translation program does not accept the original text as its source text, but rather an English source text which is derived from the semantic structure analysis of the original Greek text. The current lack of native speakers of Koine Greek is precisely what motivates the use of an English rather than a Greek source text as input to the program. Finally it is assumed that in its first draft a translation does not need to be perfect to be understand able. This is born out by the experience of anyone who has found it necessary to communicate with a non-native speaker of his or her own language. Even though this speaker may have less than a perfect control of the language, communi cation is often successful. Native speakers of a language seem to have a high degree of tolerance for imperfect grammar. The advantage of taking this position is that where imperfections in the grammar of the translated text are considered minor, they can simply be left to the post- editor to correct. 98 Implementation Details .K: Implementation Details In the analysis of the English source text included with the program, an attempt was made to eliminate lexical skewing to the fullest extent possible. It should be noted that this is not entirely necessary when translating between closely related languages, but it becomes critical when translating into minority languages which may lack abstract nouns for events like 'love' or 'forgiveness'. As noted above, an attempt was also made to utilize a very limited syntax in the analysis of the source text. Ideally each sentence of the source text should consist of a subject, verb, objects, and possibly a relative clause. Passive voice was not permitted because it does not exist in all languages, nor does it always serve the same function. In an attempt to represent all concepts using words employed in their primary senses, figures of speech such as metaphors, idioms, euphemisms, and so on were spelled out. In many languages these would cause much confusion if translated literally. (In fact, figures of speech are simply a variation on the theme of lexical skewing.) Finally, conjunctions and sentence adverbs were used in a stylized manner (i.e. they always mean the same thing). 99 To facilitate translation of meanings rather than words, a system utilizing connecting underscores and subscripting digits was employed in the preparation of the source text. For instance, 'chief_priests1' is treated as a single concept, and thus contains a connecting under score. Such underscores represent the native speaker's judgement of how the source language words should be grouped into concepts. The subscripting digit '1' is added to distinguish this concept from any others which might possibly be renderable by the same English words. The subscripting digits used are somewhat arbitrary, but in the case of verbs the digits 1 through 3 were used for first, second, and third person singular verbs, and the digits 4 through 6 were used for the plural forms. Thus 'know6' would mean 'they know'. .H: .H: $$$ .H: .F: .F: Forms such as 'chief_priests1' and 'know6' are considered to be arbitrary symbols for units of meaning. They could just as easily have been rendered as 'abc1' and 'xyz6', but this would have resulted in an input text that was unreadable. Nevertheless, the idea that these symbols are arbitrary is important. For example, 'chief_priests1' may be rendered fairly literally in one language (i.e. sacerdotes principales in Spanish), but in another language the translation might sound more like 'honored old men who 100 perform ceremonial rites'. The arbitrary forms used to represent meanings are called semantic tags in the program. Since the program is attempting to translate meanings rather than words, it uses an invention called a semanticon (see Appendix D) rather than a lexicon. Here is what an entry in the semanticon looks like: .M:1 |---- Morphological Tag | | |----- Target | | Language Semantic Tag -----| | | Sense | | | 'feast1' 'n' 'la fiesta' .M:2 Each entry in the semanticon begins with a semantic tag as described above. The next field in each entry is a morpho logical tag. A morphological tag is basically a part of speech, but it can contain additional information such as person, number, gender, tense, and so on. The morphologi cal tag refers to the target language rendering of the concept represented by the semantic tag. This target language rendering may not strictly match the semantic tag in the traditional sense. For instance, sacerdotes principales 'priests principal' is not a noun in the traditional sense, but a combination of a noun plus an adjective. However, it functions as a single unit, and for this reason the conglomerate is treated as a noun in the 101 semanticon. The next field in the semanticon entry is the target language rendering of the concept represented by the semantic tag. It generally contains a single target language word, but it may contain multiple words connected by underscores. If the morphological tag is 'n' for noun, the entry for the target language rendering consists of an article followed by one or more words connected by underscores which loosely represent a noun. If, in Spanish, the morphological tag is one of those for adjec tives, the entry consists of four words: a masculine and a feminine singular adjective and a masculine and a feminine plural adjective. The source language text to be translated (see Appendix C) contains braces. These braces are used to delimit portions of the text which should be translated as a unit. For instance, noun phrases and prepositional phrases are surrounded by braces, and the main clause is surrounded by braces unless it is the only clause in its source line. The program translates text surrounded by braces as units. For example, if a noun phrase is sur rounded by braces, the program will never make the article of that noun phrase agree with a noun which is outside that noun phrase. 102 Program Operation .K: Program Operation The program first opens all of its files, and then reads the entire semanticon into memory. (Some experienced programmers may cringe at the thought of reading the entire semanticon into memory, but memory has become a very inexpensive commodity, and its copious use greatly accel erates program execution.) Next, a sentence of untrans lated source text is read into memory, and the sentence is placed into an ICON list structure. Each element of this list structure represents one concept (i.e. word or words) from the source sentence. (A description of list struc tures is outside the scope of this thesis, but use of this structure greatly reduces the programming burden that would result if sentences were represented as strings.) Next, each concept is referenced in the semanticon, and the information obtained from the semanticon is added to the list. At this point the structure which the program has created is analogous to a sentence in the source language with target language glosses beneath each word. The program next segments the text based on the position of braces within the text. When a segment of text is located which contains no further sub-segments (delimited by braces) that segment is translated. Translation involves a 103 number of processes including adjustments to word order, word agreement, capitalization, punctuation, and phonology. When all the segments of a line of text have been trans lated, they are assembled into a string, and written to the output file. This process is repeated until all the input text has been translated. Critique .K: Critique From the discussion above it can be discerned that the program translates one sentence at a time. Thus it might seem that all discourse considerations (i.e. rela tionships between units larger than a sentence) have been ignored. However, this is not the case. It is true that because of the great similarity between the languages and cultures of English and Spanish speakers, the differences in discourse structure between the two languages is minimal. Nevertheless, it can be argued that discourse considerations have not been completely ignored because the analysis performed on the text prior to translation produced sentence adverbs and conjunctions that are used in a stylized (i.e. consistent) manner. Thus the relationship of any clause introduced by one of these sentence adverbs or conjunctions to the preceding discourse should come through clearly in the translation. 104 On the other hand, there will be problems using this approach with languages which employ an oral style of story telling in which certain information is repeated several times. I am inclined to solve this problem by making adaptations to the source text rather than to the program because the majority of the world's languages will not require this accommodation, and the ones which do will undoubtedly differ in their requirements. Another discourse consideration which deserves attention is that of pronominal reference. An example of the problem is, 'The disciples prepared the passover meal. Later they ate it.' When rendering the second sentence into Spanish the translation of it would need to be feminine 'la' to make it agree in gender with the trans lation of the word for meal 'comida'. However, in another language the word for meal might be masculine or neuter in gender. In the current version of the translation program this issue has been deliberately ignored, but only because the current version of the program is intended primarily to prove the feasibility of translating fixed texts into multiple languages by means of a computer program. Dealing with the problem of participant reference increases the size and complexity of not only the program but the source 105 text as well. This would make the program harder to understand, and the source text harder to read. Pronominal reference will be dealt with in the next version of the program. The next version of the program will also employ markers in the source text for semantic roles like agent and patient. This will make it possible to translate into languages that are ergative-absolutive. Such languages use coding schemes which are entirely different from English. For instance, in English the agent of any active sentence is normally coded, at the surface level, in the nominative (i.e. subject) case. However, in an ergative-absolutive language the agent may be realized in the ergative case at the surface level if it is the subject of a transitive verb, but it may be realized in the absolutive case if it is the subject of an intransitive verb (one which doesn't take an object). Implementing A New Target Language .K: Implementing A New Target Language To make the program translate into some other language such as French, it would first be necessary to change the semanticon to contain French renderings for the semantic tags. (The semanticon can be changed with a text 106 editor.) Note that French requires explicit subject pronouns. For instance, the entry for 'know6' would need to contain two words meaning 'they know' rather than the single Spanish word saben. Also, for some languages (not necessarily for French) it may be necessary that some concepts be expressed more specifically than is required in English. For instance, it may not be possible to simply talk about a 'brother'. It may be necessary to specify 'older brother' or 'younger brother'. In such situations it will be necessary to edit the source text to include semantic tags ('brother1' and 'brother2') which specify the more specific concepts. Fortunately, this does not render the enhanced source text unusable for languages which do not require this additional information. In such cases semantic tags like 'brother1' and 'brother2' can simply be translated into the target language equivalent of 'brother'. After this is done, it would still be necessary to make some program modifications, but they should not be too formidable for a closely related language like French. First of all, the program has some global variables containing Spanish articles. These would need to be changed to contain their French counterparts, but it would probably not be necessary to change the identifier names of 107 these global variables. Second, it would be necessary to modify the procedure contract(), because the rules for contraction are different in French. Likewise, the procedure phono_adj() which makes phonological adjustments (like 'a house' but 'an hour') would have to be modified to follow French rules. Finally, the procedures which correct word order (order() and the procedures it calls) would also need to be modified to accommodate French word order. None of the required modifications should be very time consuming since the entire program was written for Spanish in just fifteen days. Conclusion .K: Conclusion I have attempted to show some of the theoretical basis for producing machine translations, and to demonstrate the feasibility of translating fixed texts into multiple target languages using a computer program as a translation aid. I have demonstrated, via the translated text in Appendix E, that such fixed texts can be translated with a high degree of quality if the source text is adequately pre-analyzed. I have also asserted that the pre-analysis can be performed by persons who are native speakers of only the source language (i.e. English), and who may have no knowledge of any of the intended target languages. Likewise, I 108 have contended that any required post-editing can be done by persons who are fluent in only the target language, and the role of any bilingual specialists could be limited to that of consultants and translation checkers. Considering all of these points, it should be possible to produce translations of fixed texts into multiple languages using a machine translation program as a translation aid and to do so more quickly, more consistently, and at a lower cost than by traditional methods.